Scaling laws and fluctuations in the statistics of word frequencies
نویسندگان
چکیده
In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps’ law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of wordfrequencies is fat tailed (Zipf’s law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the cooccurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).
منابع مشابه
Monofractal Density Fluctuations and Scaling Laws for Count Probabilities and Combinants
The relation of combinants to various statistics characterizing the fluctuation pattern of multihadron final states is discussed. Scaling laws are derived for count probabilities and combinants in the presence of homogeneous and clustered monofractal density fluctuations. It is argued that both types of scaling rules are well suited to signal Quark-Gluon Plasma formation in a second-order QCD p...
متن کاملThe dependence of test-mass thermal noises on beam shape in gravitational-wave interferometers
In second-generation, ground-based interferometric gravitational-wave detectors such as Advanced LIGO, the dominant noise at frequencies f ∼ 40 Hz to ∼200 Hz is expected to be due to thermal fluctuations in the mirrors’ substrates and coatings which induce random fluctuations in the shape of the mirror face. The laser-light beam averages over these fluctuations; the larger the beam and the flat...
متن کاملFluctuations in fluid invasion into disordered media.
Interfaces moving in a disordered medium exhibit stochastic velocity fluctuations obeying universal scaling relations related to the presence or absence of conservation laws. For fluid invasion of porous media, we show that the fluctuations of the velocity are governed by a geometry-dependent length scale arising from fluid conservation. This result is compared to the statistics resulting from ...
متن کاملUnversal Features of the Order-parameter Fluctuations
We discuss the universal scaling laws of order parameter fluctuations in any system in which the second-order critical behavior can be identified. These scaling laws can be derived rigorously for equilibrium systems when combined with the finitesize scaling analysis. The relation between order parameter, criticality and scaling law of fluctuations has been established and the connexion between ...
متن کاملCovariations in ecological scaling laws fostered by community dynamics.
Scaling laws in ecology, intended both as functional relationships among ecologically relevant quantities and the probability distributions that characterize their occurrence, have long attracted the interest of empiricists and theoreticians. Empirical evidence exists of power laws associated with the number of species inhabiting an ecosystem, their abundances, and traits. Although their functi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1406.4441 شماره
صفحات -
تاریخ انتشار 2014